Photometric stereo recovers the surface normals of an object from multiple images with varying shading cues, i.e., modeling the relationship between surface orientation and intensity at each pixel. Photometric stereo prevails in superior per-pixel resolution and fine reconstruction details. However, it is a complicated problem because of the non-linear relationship caused by non-Lambertian surface reflectance. Recently, various deep learning methods have shown a powerful ability in the context of photometric stereo against non-Lambertian surfaces. This paper provides a comprehensive review of existing deep learning-based calibrated photometric stereo methods. We first analyze these methods from different perspectives, including input processing, supervision, and network architecture. We summarize the performance of deep learning photometric stereo models on the most widely-used benchmark data set. This demonstrates the advanced performance of deep learning-based photometric stereo methods. Finally, we give suggestions and propose future research trends based on the limitations of existing models.
translated by 谷歌翻译
在社交媒体平台上共享的用户视频通常会受到由未知专有处理程序引起的降解,这意味着它们的视觉质量比原始产品差。本文提出了一个新的一般视频修复框架,用于恢复社交媒体平台上共享的用户视频。与执行端到端映射的大多数基于学习的视频恢复方法相反,在该方法中,特征提取大部分被视为黑匣子,从某种意义上说,功能通常未知的角色,我们的新方法,称为视频通过自适应退化感测(投票)恢复,引入了降解功能图(DFM)的概念,以明确指导视频恢复过程。具体而言,对于每个视频框架,我们首先自适应地估算其DFM以提取代表难以恢复其不同区域的功能。然后,我们将DFM馈送到卷积神经网络(CNN)以计算层次结构降解功能以调节端到端视频恢复骨干网络,从而明确地将更多注意力引起到潜在的更难恢复领域的领域,这又要引起铅的领域。增强恢复性能。我们将解释投票框架的设计基本原理,并提出广泛的实验结果,以表明新的投票方法在定量和定性上都优于各种最新技术。此外,我们为在不同社交媒体平台上共享的用户视频的大规模现实世界数据库提供了贡献。代码和数据集可从https://github.com/luohongming/votes.git获得
translated by 谷歌翻译
近年来,基于深度学习的模型在视频超分辨率(VSR)方面取得了显着性能,但是这些模型中的大多数不适用于在线视频应用程序。这些方法仅考虑失真质量,而忽略了在线应用程序的关键要求,例如低延迟和模型较低的复杂性。在本文中,我们专注于在线视频传输,其中需要VSR算法来实时生成高分辨率的视频序列。为了应对此类挑战,我们提出了一种基于一种新的内核知识转移方法,称为卷积核旁路移植物(CKBG)。首先,我们设计了一个轻巧的网络结构,该结构不需要将来的帧作为输入,并节省了缓存这些帧的额外时间成本。然后,我们提出的CKBG方法通过用``核移植物)''绕过原始网络来增强这种轻巧的基础模型,这些网络是包含外部预验证图像SR模型的先验知识的额外卷积内核。在测试阶段,我们通过将其转换为简单的单路结构来进一步加速移植的多支球网络。实验结果表明,我们提出的方法可以处理高达110 fps的在线视频序列,并且模型复杂性非常低和竞争性SR性能。
translated by 谷歌翻译
高动态范围(HDR)成像是图像处理中的一个基本问题,即使在场景中存在不同的照明的情况下,它旨在产生暴露良好的图像。近年来,多曝光融合方法已取得了显着的结果,该方法合并了多个具有不同暴露的动态范围(LDR)图像,以生成相应的HDR图像。但是,在动态场景中综合HDR图像仍然具有挑战性,并且需求量很高。生产HDR图像有两个挑战:1)。 LDR图像之间的对象运动很容易在生成的结果中引起不良的幽灵伪像。 2)。由于在合并阶段对这些区域的补偿不足,因此下区域和过度曝光的区域通常包含扭曲的图像含量。在本文中,我们提出了一个多尺度采样和聚合网络,用于在动态场景中进行HDR成像。为了有效地减轻小动作和大型动作引起的问题,我们的方法通过以粗到精细的方式对LDR图像进行了暗中对齐LDR图像。此外,我们提出了一个基于离散小波转换的密集连接的网络,以改善性能,该网络将输入分解为几个非重叠频率子带,并在小波域中自适应地执行补偿。实验表明,与其他有希望的HDR成像方法相比,我们提出的方法可以在不同场景下实现最新的性能。此外,由我们的方法生成的HDR图像包含清洁剂和更详细的内容,扭曲较少,从而带来更好的视觉质量。
translated by 谷歌翻译
肺癌是最致命的癌症之一,部分诊断和治疗取决于肿瘤的准确描绘。目前是最常见的方法的人以人为本的分割,须遵守观察者间变异性,并且考虑到专家只能提供注释的事实,也是耗时的。最近展示了有前途的结果,自动和半自动肿瘤分割方法。然而,随着不同的研究人员使用各种数据集和性能指标验证了其算法,可靠地评估这些方法仍然是一个开放的挑战。通过2018年IEEE视频和图像处理(VIP)杯竞赛创建的计算机断层摄影扫描(LOTUS)基准测试的肺起源肿瘤分割的目标是提供唯一的数据集和预定义的指标,因此不同的研究人员可以开发和以统一的方式评估他们的方法。 2018年VIP杯始于42个国家的全球参与,以获得竞争数据。在注册阶段,有129名成员组成了来自10个国家的28个团队,其中9个团队将其达到最后阶段,6队成功完成了所有必要的任务。简而言之,竞争期间提出的所有算法都是基于深度学习模型与假阳性降低技术相结合。三种决赛选手开发的方法表明,有希望的肿瘤细分导致导致越来越大的努力应降低假阳性率。本次竞争稿件概述了VIP-Cup挑战,以及所提出的算法和结果。
translated by 谷歌翻译
多任务学习是基于深度学习的面部表情识别任务的有效学习策略。但是,当在不同任务之间传输信息时,大多数现有方法都考虑了特征选择,这可能在培训多任务网络时可能导致任务干扰。为了解决这个问题,我们提出了一种新颖的选择性特征共享方法,并建立一个用于面部表情识别和面部表达合成的多任务网络。该方法可以有效地转移不同任务之间的有益特征,同时过滤无用和有害信息。此外,我们采用了面部表情综合任务来扩大并平衡训练数据集以进一步提高所提出的方法的泛化能力。实验结果表明,该方法在那些常用的面部表情识别基准上实现了最先进的性能,这使其成为现实世界面部表情识别问题的潜在解决方案。
translated by 谷歌翻译
In this paper, we propose a novel framework dubbed peer learning to deal with the problem of biased scene graph generation (SGG). This framework uses predicate sampling and consensus voting (PSCV) to encourage different peers to learn from each other, improving model diversity and mitigating bias in SGG. To address the heavily long-tailed distribution of predicate classes, we propose to use predicate sampling to divide and conquer this issue. As a result, the model is less biased and makes more balanced predicate predictions. Specifically, one peer may not be sufficiently diverse to discriminate between different levels of predicate distributions. Therefore, we sample the data distribution based on frequency of predicates into sub-distributions, selecting head, body, and tail classes to combine and feed to different peers as complementary predicate knowledge during the training process. The complementary predicate knowledge of these peers is then ensembled utilizing a consensus voting strategy, which simulates a civilized voting process in our society that emphasizes the majority opinion and diminishes the minority opinion. This approach ensures that the learned representations of each peer are optimally adapted to the various data distributions. Extensive experiments on the Visual Genome dataset demonstrate that PSCV outperforms previous methods. We have established a new state-of-the-art (SOTA) on the SGCls task by achieving a mean of \textbf{31.6}.
translated by 谷歌翻译
Audio-Visual scene understanding is a challenging problem due to the unstructured spatial-temporal relations that exist in the audio signals and spatial layouts of different objects and various texture patterns in the visual images. Recently, many studies have focused on abstracting features from convolutional neural networks while the learning of explicit semantically relevant frames of sound signals and visual images has been overlooked. To this end, we present an end-to-end framework, namely attentional graph convolutional network (AGCN), for structure-aware audio-visual scene representation. First, the spectrogram of sound and input image is processed by a backbone network for feature extraction. Then, to build multi-scale hierarchical information of input features, we utilize an attention fusion mechanism to aggregate features from multiple layers of the backbone network. Notably, to well represent the salient regions and contextual information of audio-visual inputs, the salient acoustic graph (SAG) and contextual acoustic graph (CAG), salient visual graph (SVG), and contextual visual graph (CVG) are constructed for the audio-visual scene representation. Finally, the constructed graphs pass through a graph convolutional network for structure-aware audio-visual scene recognition. Extensive experimental results on the audio, visual and audio-visual scene recognition datasets show that promising results have been achieved by the AGCN methods. Visualizing graphs on the spectrograms and images have been presented to show the effectiveness of proposed CAG/SAG and CVG/SVG that could focus on the salient and semantic relevant regions.
translated by 谷歌翻译
We introduce a machine-learning (ML)-based weather simulator--called "GraphCast"--which outperforms the most accurate deterministic operational medium-range weather forecasting system in the world, as well as all previous ML baselines. GraphCast is an autoregressive model, based on graph neural networks and a novel high-resolution multi-scale mesh representation, which we trained on historical weather data from the European Centre for Medium-Range Weather Forecasts (ECMWF)'s ERA5 reanalysis archive. It can make 10-day forecasts, at 6-hour time intervals, of five surface variables and six atmospheric variables, each at 37 vertical pressure levels, on a 0.25-degree latitude-longitude grid, which corresponds to roughly 25 x 25 kilometer resolution at the equator. Our results show GraphCast is more accurate than ECMWF's deterministic operational forecasting system, HRES, on 90.0% of the 2760 variable and lead time combinations we evaluated. GraphCast also outperforms the most accurate previous ML-based weather forecasting model on 99.2% of the 252 targets it reported. GraphCast can generate a 10-day forecast (35 gigabytes of data) in under 60 seconds on Cloud TPU v4 hardware. Unlike traditional forecasting methods, ML-based forecasting scales well with data: by training on bigger, higher quality, and more recent data, the skill of the forecasts can improve. Together these results represent a key step forward in complementing and improving weather modeling with ML, open new opportunities for fast, accurate forecasting, and help realize the promise of ML-based simulation in the physical sciences.
translated by 谷歌翻译
Despite the current success of multilingual pre-training, most prior works focus on leveraging monolingual data or bilingual parallel data and overlooked the value of trilingual parallel data. This paper presents \textbf{Tri}angular Document-level \textbf{P}re-training (\textbf{TRIP}), which is the first in the field to extend the conventional monolingual and bilingual pre-training to a trilingual setting by (i) \textbf{Grafting} the same documents in two languages into one mixed document, and (ii) predicting the remaining one language as the reference translation. Our experiments on document-level MT and cross-lingual abstractive summarization show that TRIP brings by up to 3.65 d-BLEU points and 6.2 ROUGE-L points on three multilingual document-level machine translation benchmarks and one cross-lingual abstractive summarization benchmark, including multiple strong state-of-the-art (SOTA) scores. In-depth analysis indicates that TRIP improves document-level machine translation and captures better document contexts in at least three characteristics: (i) tense consistency, (ii) noun consistency and (iii) conjunction presence.
translated by 谷歌翻译